Replication Tutorial

Lars Vilhuber
April 2019

Cornell University

Overview

  • High-level overview (15:00)
  • Details of Reproducibility Checks (15:00)
  • A concrete example

Replication and Reproducibility in Social Sciences and Statistics: Context, Concerns, and Concrete Measures

Paris presentation

DOI

Details of Reproducibility Checks

Verification guidance

A concrete example

We are going to review a fully reproducible example:

  • Step 1: elements of the reproducible analysis
  • Step 2: curation of data for reproducible analysis
  • Step 3: robustness and automation

Requirements and Goals

Requirements

  • web browser
  • some R knowledge (not much)

Goals

  • show you enough of the toolkit to have you explore more
  • recognize (some) of the limitations
  • NOT make you a master of this today

Let's get started

The Census Bureau put out a blog post with data.

  • I attempted to replicate it
  • The replication itself should be replicable

The Context

the original page:

url

original page

We are going to focus on 1 figure

Original

original

Replicated

replicated

Let's start

scan

First problem

When the replicated disappear

Consider the key inputs to this replication:

  • the original article
  • the original data
  • my article replicating the original article
  • the data for my article

stacks

Safeguarding scientific output

The role of journals is to provide a permanent record of scientific knowledge.

  • how reliable is that record?
  • where are journals stored?
  • what if the information is not in a journal?

old library

Safeguarding scientific output

  • journals disappear, as do websites
  • paper journals are stored in libraries
  • e-journals in a system called LOCKSS = Lots of Copies Keep Stuff Safe
  • data should be stored in repositories

tree in library

Solving the first snag

Solving the first snag

Building a replicable document

Building a replicable document

Why would you do this

  • lay out all the steps as “literate programming”
  • can serve as the “README”!
  • ideally runs automatically

Why would you not do this

  • in general, support for citations is weak/ tricky
  • in general, not suggested when running counter to other best practices
    • becomes tricky when long-running computing is involved
    • runs counter to “short, focussed programs doing one thing” rule

Tools for a replicable document

a place to store it

  • Dropbox?
  • Github? Gitlab? Bitbucket?

a place to compute it

  • your laptop?
  • my laptop?
  • a university server?
  • a cloud server?
  • all of the above?

a programming language

  • R
  • Stata
  • Python
  • SPSS

a format for the text

  • Word?
  • \( \LaTeX \)
  • Markdown?

Tools for a replicable document

a place to store it

  • Dropbox?
  • Github! Gitlab? Bitbucket?

a place to compute it

  • your laptop?
  • my laptop?
  • a university server?
  • a cloud server!
  • all of the above!

a programming language

  • R (but don't worry!)
  • Stata
  • Python
  • SPSS

a format for the text

  • Word?
  • \( \LaTeX \)
  • Markdown!

Aside: Markdown

a format for the text

  • Word?
  • \( \LaTeX \)
  • Markdown
  • \( \overline{x} = \frac{1}{N}\sum_{i=1}^N x_i \)

Looks like this

## a format for the text
 - Word?
  - $\LaTeX$
  - **Markdown**
  - $\overline{x} = \frac{1}{N}\sum_{i=1}^N x_i$

Let's start... again

scan

The replicable document

Getting our hands dirty

Rather than squint on code on the screen, let's … replicate my replication. Online. Now.

Rstudio.cloud

Logging on to the cloud server

Rstudio.cloud login

Rstudio.cloud workspace

While you do that

Other cloud-based compute environments:

Rstudio.cloud

  • R-focused

MyBinder.org

  • Origins with Jupyter
  • Julia, Python, and R
  • different approach

https://codeocean.com

  • Software-agnostic
    • R
    • Python
    • Stata !
    • Matlab !
    • others
  • but always scripted
  • integrated versioning of the entire compute capsule

Creating a new project

Rstudio.cloud workspace

Rstudio.cloud new project

Rstudio.cloud new project from Github

Creating a new project from Github

Rstudio.cloud new project from Github

Creating a new project from Github

scan

Creating a new project from Github

scan

Notes

You could have done the same thing on your laptop

  • you might not have (the same version of) Rstudio installed (free)
  • you might not have (the same version of) R installed (free)
  • you might have a Mac/ Windows/ Linux/ old / brand new machine

All of these are issues affecting computational reproducibility

However, they do not solve everything…

Open the README document

scan

A (solved) problem of dependencies

scan

Issues of dependencies

Let me add a few things to that list:

You could have done the same thing on your laptop

  • you might not have (the same version of) Rstudio installed (free)
  • you might not have (the same version of) R installed (free)
  • you might have a Mac/ Windows/ Linux/ old / brand new machine
  • you might not have (the same version of) packages installed

Rstudio solves that for you

Go ahead, click on “install”

scan

Solving dependencies

The problem is not just in R:

  • SSC or Stata Journal packages in Stata
  • libraries or compilers in Fortran
  • Modules (paid!) in SPSS or SAS
  • packages in Python (and versions of Python!)

XKCD 1987

Solving dependencies (R)

  • use packrat or checkpoint functionality
  • declare dependencies explicitly [1]
####################################
# global libraries used everywhere #
####################################
# Package lock in - optional
MRAN.snapshot <- "2019-01-01"
options(repos = c(CRAN = paste0("https://mran.revolutionanalytics.com/snapshot/",MRAN.snapshot)))
pkgTest <- function(x)
{
        if (!require(x,character.only = TRUE))
        {
                install.packages(x,dep=TRUE)
                if(!require(x,character.only = TRUE)) stop("Package not found")
        }
        return("OK")
}
global.libraries <- c("dplyr","devtools","rprojroot","tictoc")
results <- sapply(as.list(global.libraries), pkgTest)

Solving dependencies (Stata)

  • install packages locally [1]
  • commit as part of the repository
// Make a path local to the project
// Also see my related config.do at 
//   https://gist.github.com/larsvilhuber/6bcf4ff820285a1f1b9cfff2c81ca02b

local pwd "/c/path/to/project" 
capture mkdir `pwd'/ado

sysdir set PERSONAL `pwd'/ado/personal
sysdir set PLUS     `pwd'/ado/plus
sysdir set SITE `pwd'/ado/site

/* Now install them */
/*--- SSC packages ---*/
foreach pkg in outreg esttab someprog {
  ssc install `pkg'
}

Packages installed?

Click on “Knit”

Problem solved?

Not quite

scan

Problem solved NOW?

You should have seen a pop-up window with the compiled text

  • do the graphs look the same?
  • does the text look the same?

Success!

Question:

Are we done?

Not quite…

Important

  • how permanent is my document?
  • how permanent is the data we are using?

Useful

  • how can others easily see my latest version?

Making the document more permanent

Making the document more permanent

  • we could have started on the Open Science Framework (possibly)

OSF

  • we could create a PDF and store it on Cornell's eCommons ecommons

  • we could submit to a journal!

We are going to use Zenodo

zenodo

Zenodo is the social-science (general-purpose) repository managed by CERN

CERN

Why Zenodo?

Because it makes it really easy

  • create a hook from Zenodo to Github
  • create a release on Github
  • a permanent record remains on Zenodo with a DOI DOI
    • even if you delete your Github repo!

For more info, see https://guides.github.com/activities/citable-code/

Zenodo page

Making the page more accessible

Making the page more accessible

Creating a webpage from Github-hosted code

  • Go into the settings
  • Tick the box to make it visible
  • Ensure that you have HTML pages (“Github Pages” does not render Markdown)

settings

Having Github (and some friends) create a webpage

We can go one step further

  • Have the document be created automatically when we change and commit

Challenges

  • Code needs to be replicable!
    • all the dependencies need to be solved in our code
    • won't work for paid-for software (Stata, SPSS, SAS)

How permanent is the data?

The data is obtained from a Census Bureau website.

  • The website http://www2.census.gov/ces/bds/ might be re-organized and disappear
  • The data format might change
  • The API might change
  • We only need two small chunks of code

Making the data more permanent

We used Zenodo again, but all the others are just as good!

  • We uploaded manually

zenodo

Using the permanent data

scan

Using the permanent data

If we want to incorporate the Zenodo data

We could

  • make all the changes right away
  • possibly mess up the live site/ latest version of the paper?
  • maybe annoy our co-authors?

But we used a version control system with branching!

We instead

  • created a new branch zenodo
  • made all the changes there
  • can compare the changes to the main branch
  • consult with our co-authors before pulling the changes back into the main branch
  • our live site/paper remains valid the entire time

Compare the changes: Version Control

Since we used Github, you can compare the changes: https://github.com/larsvilhuber/jobcreationblog/compare/zenodo

scan

We could then proceed to incorporate (pull) the changes into the main repository:

scan

Read more about it at https://help.github.com/en/articles/about-pull-requests

Conclusion

Conclusion

Replication can be a lot of work

We've touched on

  • Replication per se
  • Replicable documents
  • Possible pitfalls of software dependencies
  • Cloud computing platforms
  • Permanence of source material (website, data) and how to solve it

project

Conclusion

We have not covered everything

… because there can be a lot more

  • High-performance computing (length, quantity, throughput)
  • Issues with commercial (paid) software (access, permanence)
  • Data that is not public-use or easily downloadable
  • Data that you need to walk into a locked room for

SafePODS

Thank you